LLM 25-Day Course - Day 2: NLP Fundamentals and Terminology

Day 2: NLP Fundamentals and Terminology

To understand LLMs, you first need to know NLP (Natural Language Processing) terminology. Today we’ll organize the NLP terms that appear most frequently in LLM papers and documentation.

NLP Core Terminology

TermDescriptionExample
TokenThe smallest processing unit of text”Hello world” -> [“Hello”, ” world”]
CorpusA collection of text data used for trainingAll of Wikipedia, news article collections
VocabularyThe set of all tokens the model knowsGPT-4’s vocabulary size: ~100,000 tokens
EmbeddingWords converted into numeric vectors”king” -> [0.2, -0.5, 0.8, …]
SequenceAn ordered arrangement of tokensA single sentence or paragraph
AttentionA mechanism that focuses on important parts of the inputIn “He ate the apple,” determining who “he” refers to
EncodingConverting input into internal representationSentence -> vector
DecodingConverting internal representation into outputVector -> sentence
PerplexityA metric of model prediction uncertainty (lower is better)PPL=15: roughly 15 candidates for the next word
Context WindowThe number of tokens a model can process at onceModern models support tens to hundreds of thousands of tokens

Basic Tokenization Concept

# Simplest tokenization: splitting by whitespace
sentence = "Natural language processing is really fascinating"
tokens_simple = sentence.split()
print(tokens_simple)
# ['Natural', 'language', 'processing', 'is', 'really', 'fascinating']

# Real LLMs use subword tokenization
# "fascinating" -> ["fasc", "inating"] — split into smaller pieces

Intuitive Understanding of Embeddings

import numpy as np

# Embeddings: representing words as numeric vectors
# Words with similar meanings are located close together in vector space
embeddings = {
    "king":   np.array([0.8, 0.2, -0.5, 0.9]),
    "queen":  np.array([0.7, 0.3, -0.4, 0.85]),
    "apple":  np.array([-0.2, 0.9, 0.6, -0.1]),
}

# Measuring similarity between words using cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"king-queen similarity: {cosine_similarity(embeddings['king'], embeddings['queen']):.3f}")
print(f"king-apple similarity: {cosine_similarity(embeddings['king'], embeddings['apple']):.3f}")
# king-queen: high similarity / king-apple: low similarity

Perplexity Calculation Example

import numpy as np

# Perplexity: how well a model predicts the next word
# PPL = exp(average cross-entropy loss)
def calculate_perplexity(loss):
    return np.exp(loss)

good_model_loss = 2.7    # Well-trained model
bad_model_loss = 5.5     # Poorly trained model

print(f"Good model PPL: {calculate_perplexity(good_model_loss):.1f}")
print(f"Bad model PPL: {calculate_perplexity(bad_model_loss):.1f}")
# Lower PPL means better next-word prediction

NLP terminology cannot be memorized in a single day. Use this table as a reference as we dive deeper into each concept in the days ahead.

Today’s Exercises

  1. Tokenize the sentence “Artificial intelligence is changing the world” by whitespace, by syllable, and by meaning. Explain the differences.
  2. Summarize the pros and cons of larger embedding vector dimensions. Compare Word2Vec (300 dimensions) with modern large model embeddings (higher dimensions).
  3. Think about what it means if Perplexity equals 1, and whether a model with PPL=1 is achievable in practice.

Was this article helpful?